Get our free extension to see links to code for papers anywhere online!Free add-on: code for papers everywhere!Free add-on: See code for papers anywhere!

Add to Chrome

Add to Firefox

Add to Edge

FNU Sidharth

Unmixing the Crowd: Learning Mixture-to-Set Speaker Embeddings for Enrollment-Free Target Speech Extraction

Apr 03, 2026

FNU Sidharth, Meysam Asgari, Hao-Wen Dong, Dhruv Jain

Abstract:Personalized or target speech extraction (TSE) typically needs a clean enrollment -- hard to obtain in real-world crowded environments. We remove the essential need for enrollment by predicting, from the mixture itself, a small set of per-speaker embeddings that serve as the control signal for extraction. Our model maps a noisy mixture directly to a small set of candidate speaker embeddings trained to align with a strong single-speaker speaker-embedding space via permutation-invariant teacher supervision. On noisy LibriMix, the resulting embeddings form a structured and clusterable identity space, outperforming WavLM+K-means and separation-derived embeddings in standard clustering metrics. Conditioning these embeddings into multiple extraction back-ends consistently improves objective quality and intelligibility, and generalizes to real DNS-Challenge recordings.

* Submitted to ISCA Interspeech 2026

Via

Access Paper or Ask Questions

Real-time Speech Enhancement on Raw Signals with Deep State-space Modeling

Sep 07, 2024

Yan Ru Pei, Ritik Shrivastava, FNU Sidharth

Abstract:We present aTENNuate, a simple deep state-space autoencoder configured for efficient online raw speech enhancement in an end-to-end fashion. The network's performance is primarily evaluated on raw speech denoising, with additional assessments on tasks such as super-resolution and de-quantization. We benchmark aTENNuate on the VoiceBank + DEMAND and the Microsoft DNS1 synthetic test sets. The network outperforms previous real-time denoising models in terms of PESQ score, parameter count, MACs, and latency. Even as a raw waveform processing model, the model maintains high fidelity to the clean signal with minimal audible artifacts. In addition, the model remains performant even when the noisy input is compressed down to 4000Hz and 4 bits, suggesting general speech enhancement capabilities in low-resource environments. Code is available at github.com/Brainchip-Inc/aTENNuate

* 7 pages, 2 figures

Via

Access Paper or Ask Questions

Raw Speech Enhancement with Deep State Space Modeling

Sep 05, 2024

Yan Ru Pei, Ritik Shrivastava, FNU Sidharth

* 7 pages, 2 figures

Via

Access Paper or Ask Questions